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Abstract. Regular expressions are a concise yet expressive language 
for expressing patterns. For instance, in networked software, they are 
used for input validation and intrusion detection. Yet some widely de- 
ployed regular expression matchers based on backtracking are themselves 
vulnerable to denial-of-service attacks, since their runtime can be expo- 
nential for certain input strings. This paper presents a static analysis for 
detecting such vulnerable regular expressions. The running time of the 
analysis compares favourably with tools based on fuzzing, that is, ran- 
domly generating inputs and measuring how long matching them takes. 
Unlike fuzzers, the analysis pinpoints the source of the vulnerability and 
generates possible malicious inputs for programmers to use in security 
testing. Moreover, the analysis has a firm theoretical foundation in ab- 
stract machines. Testing the analysis on two large repositories of regular 
expressions shows that the analysis is able to find significant numbers of 
vulnerable regular expressions in a matter of seconds. 

1 Introduction 

Regular expression matching is a ubiquitous technique for reading and validat- 
ing input, particularly in web software. While pattern matchers are among the 
standard techniques for defending against malicious input, they are themselves 
vulnerable. The root cause of the vulnerability is that widely deployed regular 
expression matchers, like the one in the Java libraries, are based on backtracking 
algorithms, rather than the construction of a Deterministic Finite Automaton 
(DFA), as used for lexers in compiler construction [1112] . One reason for relying 
on backtracking rather than a DFA construction is to support a more expressive 
pattern specification language commonly referred to as "regexes". Constructs 
such as back-references supported by such regex languages go beyond regular 
and even context-free languages and are known to be computationally expen- 
sive [I]. However, even if restricted to purely regular constructs, backtracking 
matchers may have a running time that is exponential in the size of the input ^ , 
potentially causing a regular expression denial-of-service (ReDoS) attack [T7]. 
It is this potentially exponential runtime on pure regular expressions (without 
backreferences) that we are concerned about in this paper. Part of our motiva- 
tion is that, for purely regular expressions, the attack could be defended against 
by avoiding backtracking matchers and using more efficient techniques [7124) 
instead. 



For a minimalistic example consider matching the regular expression 
a** against the input string a. . . a b, with n repetitions of a. A backtracking 
matcher takes an exponential time in n when trying to find a match; all 
matching attempts fail in the end due to the trailing b. For such vulnerable 
regular expressions, an attacker can craft an input of moderate size which causes 
the matcher to take so long that for all practical purposes the matcher fails to 
terminate, leading to a denial-of-service attack. Here we assume that the regular 
expression itself cannot be manipulated by the attacker but that it is matched 
against a string that is user-malleable. 

While the regular expression a** as above is contrived, one of the questions 
we set out to answer is how prevalent such vulnerable expressions are in the 
real world. As finding vulnerabilities manually in code is time consuming and 
error-prone, there is growing interest in automated tools for static analysis for 
security |12I5| . motivating us to design an analysis for ReDoS. 

Educating and warning programmers is crucial to defending against attacks 
on software. The standard coverage of regular expressions in the computer sci- 
ence curriculum, covering DFAs in courses on computability or compiler 
construction [2 , is not necessarily sufficient to raise awareness about the pos- 
sibility of ReDoS. Our analysis constructs a series of attack strings, so that 
developers can confirm the exponential runtime for themselves. 

This paper makes the following contributions: 

1. We present an efficient static analysis for DoS on pure regular expressions. 

2. The design of the tool has a firm theoretical foundation based on abstract 
machines |18| and derivatives ^ for regular expressions. 

3. We report finding vulnerable regular expressions in the wild. 

In Section [21 we describe backtracking regular expression matchers as ab- 
stract machines, so that we have a precise model of what it means for a matching 
attempt to take an exponential number of steps. We build on the abstract ma- 
chine in designing our static analysis in Section |3l which we have implemented 
in OCaml as described in Section H) Experimental results in testing the analysis 
on two large corpora of regular expressions are reported in Section [S] Finally, 
Section [6] concludes with a discussion of related work and directions of further 
research. The code of the tool and data sets are available at this URL: 
|http : //www . cs ■ bham . ac . uk/~hxt/research/rxxr . shtml| 



2 Regular expression matching by backtracking 

This and the next section present the theoretical basis for our analysis. Readers 
primarily interested in the results may wish to skim them. 

We start with the following minimal syntax for regular expressions: 



e ::— 



ei 62 



Alternation 



e* 



Kleene star 



ei • 62 



Concatenation 



a 



Constant, where a is an input symbol 



The • in concatenation ei • 62 is usually omitted, except when it is useful for em- 
phasis, as in a syntax tree. Following the usual parser construction methods [2], 
we can define a parser which is capable of transforming (parsing) a given regular 
expression into an AST (abstract syntax tree) which complies with the above 
grammar. As an example, the AST constructed by such a parser for the regular 
expression (a | b)*c can be visualized in the following manner: 
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Notice that we have employed a pointer notation to illustrate the AST struc- 
ture; this is quite natural given that in most programming languages, such an 
AST would be defined using a similar pointer-based structure definition. Each 
node of this AST corresponds to a unique sub-expression of the original regular 
expression, the relationships among these nodes are given on the table to the 
right. We have used the notation tt{p) to signify the dereferencing of the pointer 
p with respect to the heap tt in which the above AST is constructed. A formal 
definition of tt was avoided in order to keep the notational clutter to a minimum, 
interested readers may refer [TS] for a more precise definition of tt. 

Having parsed the regular expression into an AST, the next step is to con- 
struct an NFA structure that allows us to define a backtracking pattern matcher. 
While there are several standard NFA construction techniques ^ , we opt for a 
slightly different construction which greatly simplifies the rest of the discussion. 
The idea is to associate a continuation pointer cont with each of the nodes in the 
AST such that cont points to the following (continuation) expression for each 
of the sub-expressions in the AST. In other words, cont identifies the "next sub- 
expression" which must be matched after matching the given sub-expression. 
More formally, cont is defined as follows: 



Definition 1. Let cont be a function 



cont ; dom(7r) (dom(7r) U {null}) 

Such that, 



— If 7r(p) — (pi I P2), then contpi = contp and contp2 = contp 

— IfTT{p) = (pi •P2), then contpi = P2 o,nd contp2 = contp 

— If Tr{p) = (pi)*, then contpi =p 

— contpo = null, where po is the pointer to the root of the AST. 



The following example illustrates the NFA constructed this way for the regular 
expression (a | b)*c: 




Here the dashed arrows identify the cont pointer for each of the AST nodes. 
Readers familiar with Thompson's construction |24I2| will realize that the re- 
sulting NFA is a slightly pessimized version of that resulting from Thompson's 
algorithm. The reason for this pessimization is purely of presentational nature; it 
helps to visualize the NFA as an AST with an overlay of a cont pointer mesh so 
that the structure of the original regular expression is still available in the AST 
portion. Furthermore, this presentation allows the definitions and proofs to be 
presented in an inductive fashion with respect to the structure of the expressions. 

With the NFA defined, we present a simple non-deterministic regular ex- 
pression matcher in the form of an abstract-machine called the FWtt machine: 



Definition 2. A configuration of the PWt: machine consists of two components: 

{p ; w) 

Thep component represents the current sub- expression (similar to a code pointer) 
while w corresponds to the rest of the input string that remains to be matched. 
The transitions of this machine are as follows: 



{p '-w) - 


-^{Pi 




fTT{p) 


= {Pl 1 P2) 


(p ;w) - 


{P2 


;w) i 


fTT{p) 


= {Pl 1 P2) 


(p ;w) - 




w) if 


tt{p) = 


= Pl* A contp = q 


(p ;w) - 


-^{Pl 




fTT{p) 




{p;w) - 


-> {Pl 


;w) i 


fTT{p) 


= {Pi ■P2) 


{p ; aw) - 




w) if 


Tr{p) = 


= a A contp = q 


ip ',w) - 




w) if 


Tr{p) = 


= e A contp = q 



The initial state of the PWtt machine is (po ; w), where po is the root of the AST 
corresponding to the input expression and w is the input string. The machine 
may terminate in the state (null ; w") where it has matched the original regular 



expression against some prefix w' of the original input string w such that w = 
w'w". Apart from the successful termination, the machine may also terminate if 
it enters into a configuration where none of the above transitions apply. 

The PWtt machine searches for a matching prefix by non-deterministicahy mak- 
ing a choice whenever it has to branch at ahernation or Kleene nodes. While 
this machine is not very useful in practice, it allows us to arrive at a precise 
model for backtracking regular expression matchers. Backtracking matchers op- 
erate by attempting all the possible search paths in order; this allows us to model 
them with a stack of PWtt machines. We call the resulting machine the PWFtt 
machine: 

Definition 3. The PWFtt machine consists of a stack of PWtt machines. The 
transitions of the PWFtt machine are given below: 

{p;w) ^ {q; w') {p]w) ^ 

{p;w) f ^ {q: w') :: f {p ; w) :: f ^ f 

{p;w) ^ {qi ; w) {p;w) ^ (92 ; w) 
{p;w) :: f ^ (gi ; w) :: {q2 ; w) f 

The initial state of the PWFtt machine is [{po ; w)]. The machine may ter- 
minate if one of the PWtt machines locates a match or if none of them succeeds 
in finding a match. In the latter case the PWFtt machine has exhausted the en- 
tire search space and determined that the input string cannot be matched by the 
regular expression in question. 

The PWFtt machine allows us to analyze backtracking regular expression match- 
ers at an abstract level without concerning ourselves about any implementation 
specific details. More importantly, it gives an accurate cost model of backtrack- 
ing matchers; the number of steps executed by the PWFtt machine corresponds 
to the amount of work a backtracking matcher has to perform when searching 
for a match. In the following sections we employ these ideas to develop and 
implement our static analysis. 

3 Static analysis for exponential blowup 

The problem we are aiming to solve is this: given a regular expression e, repre- 
sented as in Section [21 are there input strings x, y, and z, such that: 

1. Reading x takes the machine to a pointer po that is the root of a Kleene star 
expression. 

2. Reading the input w takes the machine from po back to po, and in at least 
two different ways, that is, along two different paths in the NFA. 

3. Reading the input z when starting from po causes the match to fail. 



r 




Fig. 3.1. The search tree for xwwy 



We call X the prefix, w the pumpable string by analogy with pumping lemmas 
in automata theory , and z the failure suffix. 

From these three strings, malicious inputs can be constructed: the n-th ma- 
licious input is xw" z. Figure [XT] illustrates the search tree that a backtracking 
matcher has to explore when w is pumped twice. Because w can be matched in 
two different ways, the tree branches every time a w is read from the input. All 
branches fail in the end due to the trailing z, so that the matcher must explore 
the whole tree. 

To state the analysis more formally, we will need to define paths in the 
matcher. 

Definition 4. A path of pointers, t : p q is defined according to the following 
inductive rules: 

— For each pointer p, [p] : p p is a path (identity). 

— If t : p q is a path and there exists a PWtt transition such that: 

{q ; w'wi) {q' ; wi) 

Then t ■ [q'] : p q' is also a path. 

Lemma 1. The path t : p — ^ q (q ^ p) exists if and only if a PWtt run exists 
such that: 

{p ; ww') —!>■•• ^ (g ; w') 

Lemma [1] associates a unique string w with each path of pointers (the sub- 
string matched by the corresponding PWtt run). However, note that the inverse 
of this implication does not hold; there can be input strings for which we may 
find more than one PWtt run. In fact, it is this property of paths that leads us 
to the main theorem of this paper: 



Theorem 1. For a given Kleene expression pq where tt{pq) = p\* , if at least 
two paths exist such that ti ; pi po, ^2 ■ Pi Po o.nd ti ^ t2, then a 
regular expression involving po exhibits o(2") runtime on a backtracking regular 
expression matcher for input strings of the form xw"z where x is a sub-string 
matching the prefix of pq and z is such that xw"z fails to match the overall 
expression. 

While a formal proof of Theorem [T] is outside of the scope of this paper, we 
sketch its proof with reference to Figure 13.11 The prefix x causes the PWFtt 
machine to advance into a state where it has to match po against the remainder 
of the input string, which leads to the branching of the search tree. Finally, the 
suffix z at the end of the input causes each search path to fail, which in turns 
forces the PWFtt machine to backtrack and explore the entire search tree before 
concluding that a match cannot be found. For the complexity, note that each 
additional pumping increases the size of the input by a constant (the length of 
w) whereas it doubles the size of the binary subtree given by the w branches, 
as well as the number of failed attempts to match z at the end. If there are 
more than 2 ways to match the pumpable string, say 6, then b rather than 2 
becomes the base of the exponent, but 2 is still a lower bound. The matching of 
the prefix x at the beginning contributes a constant to the runtime, which can 
be disregarded relative to the exponential growth. Thus the lower bound for the 
number of steps is exponential. 

3.1 Generating the pumpable string 

The most important step in generating an attack string for a vulnerable regular 
expression is to generate the pumpable string w in xw^z (for some Kleene sub- 
expression). In order to arrive at the machine for building the pumpable string, 
we must first introduce several utility definitions. Note that in the remainder of 
this discussion, po refers to a Kleene expression such that 7r(po) = Pi*- 

Definition 5. For a given pointer p, the operation Dp (called evolve) is defined 
as: 

□ p ~ [q I 3t.t : p q A 3a.T:{q) = a] 
Notice that the result of Op is a list of pointers. 

Definition 6. The function VaiP) • (called derive) is defined on a list of pointers 
P and an input symbol a according to the following rules: 

Vaih :: t) 

The definition T>a{P) is analogous to Brzozowski's derivatives of regular ex- 
pressions [4]. In essence, the analysis computes derivatives of a Kleene expression 
in order to find two different matcher states for the same input string. 



Va{t) if ttQi) ^b,bi^ a 

q :: Vait) if 7r(/i) = a A cont h = q 

"DaiOh ■ t) otherwise. 



Definition 7. A wP frame is defined as a pair (w, P) where w is a string and 
P is a list of pointers. A non- deterministic transition relation is defined on wP 
frames as follows: 

(«;,P) ^ {wa,V,{P)) 
Definition 8. The HFt: machine has configurations of the following form: 

{H;f) 

Here H (history) represents a set of (sorted) pointer lists and f is a list of wP 
frames. A deterministic transition relation defines the behavior of this machine 
as follows: 

(w, P) ^ {wxa, Po) ... (w, P) {wXn,Pn) Vi..T, G 17 P^iH 

{H;iw,P) :: /) ^ {H U {Po, . . . , Pn} 'J ■ [{wxo, Po), ■ ■ ■ , {wXn, Pn)]) 

The initial configuration of the HFn machine is (0 ; [(e, [pi])]) and the machine 
can terminate in either of the following two configurations: 

{H;[]) 

{H ; {w, P) :: /) where 3p',p" e P. 3t' , t" . t' : p' ^ po /\ t" : p" po 

In the former configuration the machine has determined the Kleene expression 
in question to he non-vulnerable while in the latter it has derived the pumpable 
string w. 

3.2 Generating the Prefix and the Suffix 

For a regular expression of the form ei (e2*) 63, apart from a pumpable string w, 
we must also generate a prefix x and a suffix z. The intention is that x would 
lead the matcher to the point where it has to match 62*, after which we can 
pump many copies of w to increase the search space of the matcher. However, a 
successful exploit also needs a suffix z which forces the matcher to fail and so to 
traverse the entire search space. 

Generating the prefix is quite straightforward since a depth-first search for 
Kleene sub-expressions (on the AST) can be augmented such that a (minimal) 
prefix is generated for each search-path. On the other hand, the suffix generation 
is more involved. One would think z should be generated such that it fails to 
match the continuation expression 63 of the original expression, but this intuition 
is flawed since there is a possibility that z could be matched by 62 itself (while 
it was meant for 63). Depending on 62, it could be the case that no failure suffix 
exists. One example we found is that 62 ends in . *, so that it can match anything. 
In other failure suffix may exists, but depend in complicated ways on 

62. We chose not to solve this problem in full generality, but rather to employ 
heuristics that find failure suffixes for many practical expressions, as illustrated 
in the results section. 



4 Implementation of the static analysis 

We implemented the HFtt machine described in Section [3] using the OCaml 
programming language. OCaml is well suited to programming abstract syntax, 
and hence a popular choice for writing static analyses. One of the major ob- 
stacles faced with the implementation is that in order to be able to analyze 
real-world regular expressions, it was necessary to build a sophisticated parser. 
In this regard, we decided to support the most common elements of the Perl / 
PCRE standards, as these seem to be the most commonly used (and adapted) 
syntaxes. It should be noted that the current implementation does not support 
back-references or look-around expressions due to their inherent complexity; it 
remains to be seen if the static analysis proposed in this work can be adapted 
to handle such "regexes" . However, as it was explained earlier, exponential vul- 
nerabilities in pattern specifications are not necessarily dependent on the use of 
back-references or other advanced constructs (although one would expect such 
constructs to further increase the search space of a backtracking matcher). A 
detailed description of the pattern specification syntax currently supported by 
the implementation has been included in the resources accompanying this paper. 

The implementation closely follows the description of the HFtt machine pre- 
sented in Section [3l The history component H is implemented as a set of sorted 
integer lists, where a single sorted integer list corresponds to a list of nodes 
pointed by the pointer list P of a wP frame {w,P). This representation allows 
for quick elimination of looping wP frames. While the size of H is potentially 
exponential in the number of nodes of a given Kleene expression, for practical 
regular expressions we found this size to be well within manageable levels (as 
evidenced in the results section). 

A matter not addressed in the current work is that the PWFtt machine (and 
naive backtracking matching algorithms in general) can enter into infinite loops 
for Kleene expressions when the enclosed sub-expression matches the empty 
string (i.e the sub-expression is unliable) . Although a complete treatment of this 
issue and its solution (implemented by most of the well known backtracking 
matchers) is beyond the scope of this paper, it should be mentioned that a 
similar problem occurs in the HFtt machine during the Dp operation. We have 
incorporated a method for detecting and terminating such infinite loops into the 
OCaml code for the Op function so that it terminates in all cases. 

5 Experimental results 

The analysis was tested on two corpora of regexes (Figure [1]). The first of these 
was extracted from an online regex library called RegExLib [19] . which is a 
community-maintained regex archive; programmers from various disciplines sub- 
mit their solutions to various pattern matching tasks, so that other developers 
can reuse these expressions for their own pattern matching needs. The second 
corpus was extracted from the popular intrusion detection and prevention system 
Snort [23] , which contains regex-based pattern matching rules for inspecting IP 
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Table 1. Experimental results with RegExLib and Snort 



packets across network boundaries. The contrasting purposes of these two cor- 
pora allow us to get a better view of the seriousness of exponential vulnerabilities 
in practical regular expressions. 

The regex archive for RegExLib was only available through the correspond- 
ing website P^. Therefore, as the first step the expressions had to be scraped 
from their web source and adapted so that they can be fed into our tool. These 
adaptations include removing unnecessary white-space, comments and spurious 
line breaks. A detailed description of these adjustments as well as copies of both 
adjusted and un-adjusted data sets have been included with the resources ac- 
companying this paper (also including the Python script used for scraping) . The 
regexes for Snort, on the other hand, are embedded within plain text files that 
define the Snort rule set. A Python script (also included in the accompanying 
resources) allowed the extraction of these regexes, and no further processing was 
necessary. 

The results of the HFtt static analysis on these two corpora of regexes are 
presented in Table [TJ The figures show that we can process around 75% of each 
of the corpora with the current level of syntax support. Out of these analyzable 
amounts, it is notable that regular expressions from the RegExLib archive use 
the Kleene operator more frequently (about 50% of the analyzable expressions) 
than those from the Snort rule set (close to 30%). About 11.5% of the Kleene- 
based RegExLib expressions were found to have a pumpable Kleene expression 
as well as a suitable suffix, whereas for Snort this figure stands around 0.55%. 

The vulnerabilities reported range from trivial programming errors to more 
complicated cases. For an example, the following regular expression is meant to 
validate time values in 24- hour format (from RegExLib): 

- ( ( [01] [0-9] I [012] [0-3] ) : ( [0-5] [0-9] ) ) *$ 

Here the author has mistakenly used the Kleene operator instead of the ? op- 
erator to suggest the presence or non-presence of the value. This pattern works 



perfectly for all intended inputs. However, our analysis reports that this expres- 
sion is vulnerable with the pumpable string "13:59" and the suffix "/". This 
result gives the programmer a warning that the regular expression presents a 
DoS security risk if exposed to user-malleable input strings to match. 

For a moderately complicated example, consider the following regular ex- 
pression (again from RegExLib): 

- ( [a-zA-z] : ( (W ( [-*\ . *\w+\s+\d+] +) I (\w+) W) +) (\w+ . zip) I (\w+ . ZIP) ) $ 

This expression is meant to validate file paths to zip archives. Our tool identifies 
this expression as vulnerable and generates the prefix "z : \ " , the pumpable 
string "\zzz\" and the empty string as the suffix. This is probably an unexpected 
input in the author's eye, and this is another way in which our tool can be 
useful in that it can point out potential mis-interpretations which may have 
materialized as vulnerabilities. 

It is worth noting that the HFtt machine manages to classify both the corpora 
(the analyzable portions) in a matter of seconds on modest hardware. This shows 
that our static analysis is usable for most practical purposes, with the average 
classification time for an expression in the range of micro-seconds. The two 
extreme cases for which the machine took several seconds for the classification 
are given below (only the respective Kleene expressions): 

( [\d\w] [-\d\w] {0 , 253} [\d\w] \ . ) + 

([~\xOO] {0,255>\x00)* 

Here counting expressions [-\d\w] {0 , 253} and [~\xOO] {0 , 255} were ex- 
panded out during the parsing phase. The expansion produces a large Kleene 
expression, which naturally requires more analysis during the HFtt simulation. 
However, it should be noted that such expressions are the exception rather than 
the norm. 

Finally, it should be mentioned that all the vulnerabilities reported above 
were individually verified using a modified version of the FWFtt machine (which 
counts the number of steps taken for a particular matching operation) . A sample 
of those vulnerabilities was also tested on the Java regular expression matcher. 

6 Conclusions 

We have presented a static analysis to help programmers defend against regular 
expression DoS attacks. Large numbers of regular expressions can be analysed 
quickly, and developers are given feedback on where in their regular expressions 
the problem has been identified as well as examples of malicious input. 

As illustrated in Section [SJ the prefix, pumpable string and failure suffix can 
be quite short. If their length is, say, 3, 5 and characters, then an attacker 
only needs to spend a very small amount of effort in providing a malicious 
input of length 3-1-5*100 characters to cause a matching time in excess of 2-'^™ 
steps. Even if a matching step takes only a nanosecond, such a running time 



takes, for all intents and purposes, forever. The attacker can still scale up the 
attack by pumping a few times more and thereby correspondingly multiplying 
the matching time. 

The fact that the complexity of checking a regular expression for exponential 
runtime may be computationally expensive in the worst case does not necessarily 
imply that such an analysis is futile. Type checking in functional languages like 
ML and Haskell also has high complexity |14I21| , yet works efficiently in practice 
because the worst cases rarely occur in real- world code. There are even program 
analyses for undecidable problems like termination [3^, so that the worst-case 
running time is infinite; what matters is that the analysis produces results in 
enough cases to be useful in practice. It is a common situation in program 
analysis that tools are not infallible (having false positives and negatives), but 
they are nonetheless useful for identifying points in code that need attention by 
a human expert [9]. 

6.1 Related work 

A general class of DoS attacks based on algorithmic complexities has been ex- 
plored in [5]. In particular, the exponential runtime behavior of backtracking 
regular expression matchers has been discussed in |Q and [5D]. The seriousness 
of this issue is further expounded in [35] and [TB] where the authors demonstrate 
the mounting of DoS attacks on an IDS/IPS system (Snort) by exploiting the 
said vulnerability. The solutions proposed in these two works involve modifying 
the regular expressions and/or the matching algorithm in order to circumvent 
the problem in the context of IDS/IPS systems. We consider our work to be 
quite orthogonal and more general since it is based on a compile-time static 
analysis of regular expressions. However, it should be noted that both of those 
works concern of regexes with back-references, which is a feature we are yet to 
explore (known to be NP-hard [1]). 

While the problem of ReDoS has been known for at least a decade, we are not 
aware of any previous static analysis for defending against it. A handful of tools 
exist that can assist programmers in finding such vulnerable regexes. Among 
these tools we found Microsoft's SDL Regex Fuzzer [T^ and the RegexBuddy [T3] 
to be the most usable implementations, as other tools were too unstable to be 
tested with complex expressions. 

While RegexBuddy itself is not a security oriented software, it offers a debug 
mode, which can be used to detect what the authors of the tool refer to as 
Catastrophic Backtracking |10) . Even though such visual debugging methods 
can assist in detecting potential vulnerabilities, it would only be effective if the 
attack string is known in advance — this is where a static analysis method like 
the one presented on this paper has a clear advantage. 

SDL Fuzzer, on the other hand, is aimed specifically at analyzing regular 
expression vulnerabilities. While details of the tool's internal workings are not 
publicly available, analyzing the associated documentation reveals that it oper- 
ates fuzzing, i.e., by brute- forcing a sequence of generated strings through the 



regular expression in question to detect long running times. The main disadvan- 
tage of this tool is that it can take a very long time for the tool to classify a 
given expression. Tests using some of the regular expressions used in the results 
section above revealed that it can take up to four minutes for the Fuzzer to 
classify certain expressions. It is an inherent limitation of fuzzers for exponen- 
tial runtime DoS attacks that the finding out if something takes a long time 
by running it takes a long time. By contrast, our analysis statically analyzes 
an expression without ever running it. It is capable of classifying thousands of 
regular expressions in a matter of seconds. Furthermore, the output produced 
by the SDL Fuzzer only reports the fact that the expression in question failed 
to execute within a given time limit for some input string. Using this generated 
input string to pin-point the exact problem in the expression would be quite 
a daunting task. In contrast, our static analysis pin-points the exact Kleene 
expression that causes the vulnerability and allows programmers to test their 
matchers with a sequence of malicious inputs. 

6.2 Directions for further research 

In further work, we aim to broaden the coverage of our tool to include more 
regexes. Given its basis in our earlier work on abstract machines |18j and deriva- 
tives [1], we aim for a formal proof of the correctness of our analysis. We intend 
to release the source code of the tool as an open source project. More broadly, 
we hope that raising awareness of the dangers of backtracking matchers will help 
in the adoption of superior techniques for regular expression matching }7|24|18] . 
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